Problem Set 1 Solutions

STAT 436, Fall 2024

NYC Flights

Scoring

  • a, Code (0.5 points): Correct and concise (0.5 points), Correct but unnecessarily complex (0.25 points), Incorrect (0 points)
  • b, Design (1 point): Clear and customized visual design (1 point), Technically correct but with some inattention to detail (0.25 points), Incorrect (0 points)
  • b, Discussion (0.5 points): Complete interpretation (0.5 points), Incomplete or unjustified interpretation (0.25 points), Incorrect (0 points)

Question

The following questions refer to the NYC flights dataset. The first few lines are printed below.

library(nycflights13)
flights |> 
    select(carrier, air_time, distance)
## # A tibble: 336,776 × 3
##    carrier air_time distance
##    <chr>      <dbl>    <dbl>
##  1 UA           227     1400
##  2 UA           227     1416
##  3 AA           160     1089
##  4 B6           183     1576
##  5 DL           116      762
##  6 UA           150      719
##  7 B6           158     1065
##  8 EV            53      229
##  9 B6           140      944
## 10 AA           138      733
## # ℹ 336,766 more rows
  1. Provide code to create a new column giving the average speed of the flight: \(\texttt{speed} := \frac{\texttt{distance}}{\texttt{air_time}}\).
flights |>
    mutate(speed = distance / air_time)
## # A tibble: 336,776 × 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute time_hour           speed
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl> <dttm>              <dbl>
##  1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15 2013-01-01 05:00:00  6.17
##  2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29 2013-01-01 05:00:00  6.24
##  3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40 2013-01-01 05:00:00  6.81
##  4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45 2013-01-01 05:00:00  8.61
##  5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6      0 2013-01-01 06:00:00  6.57
##  6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150      719     5     58 2013-01-01 05:00:00  4.79
##  7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR    FLL        158     1065     6      0 2013-01-01 06:00:00  6.74
##  8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0 2013-01-01 06:00:00  4.32
##  9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK    MCO        140      944     6      0 2013-01-01 06:00:00  6.74
## 10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0 2013-01-01 06:00:00  5.31
## # ℹ 336,766 more rows
  1. Is there a large variation in flight speed across carriers? Design and sketch code for a visualization that could be used to answer this question (you may assume the output of (a)).

We provide three potential solutions at varying levels of granularity.

flights |>
    mutate(speed = distance / air_time) |>
    ggplot() +
    geom_boxplot(aes(speed, reorder(carrier, speed, median, na.rm = TRUE))) +
    labs(x = "Average Flight Speed (Miles/Minute)", y = "Carrier")

flights |>
    mutate(speed = distance / air_time) |>
    ggplot() +
    geom_point(
        aes(speed, reorder(carrier, speed, median, na.rm = TRUE)),
        alpha = 0.05, size = 0.5, position = position_jitter(h = 0.1)
    ) +
    labs(x = "Average Flight Speed (Miles/Minute)", y = "Carrier")

flights |>
    mutate(speed = distance / air_time) |>
    ggplot() +
    geom_histogram(aes(speed)) +
    facet_wrap(~ reorder(carrier, speed, mean, na.rm = TRUE), ncol = 8) +
    labs(x = "Average Flight Speed (Miles/Minute)", y = "Frequency")

London Olympics

Scoring

  • c, Design (0.5 points): Clear and customized visual design (0.5 points), Technically correct but with some inattention to detail (0.25 points), Inappropriate visual encodings (0 points)
  • c, Discussion (0.5 points): Correct and concise (0.5 points), Incomplete or unjustified interpretation (0.25 points), Incorrect (0 points)
  • a - b, Code (0.5 points): Correct and concise (0.5 points), Correct but unnecessarily complex (0.25 points), Incorrect (0 points)

Question

The data at this link describes all participants in the London 2012 Olympics.

  1. Create a layered display that shows (i) the ages of athletes across sports and (ii) the average age within each sport.
olympics <- read_csv("https://uwmadison.box.com/shared/static/rzw8h2x6dp5693gdbpgxaf2koqijo12l.csv")
averages <- olympics |>
    group_by(Sport) |>
    summarise(Age = mean(Age))

ggplot(olympics, aes(Age, Sport)) +
    geom_point(position = position_jitter(h = .2), size = 0.5, col = "#63CAF2") +
    geom_point(data = averages, col = "#184059")

  1. Sort the sports from lowest to highest average age.
ggplot(olympics, aes(Age, reorder(Sport, Age))) +
    geom_point(position = position_jitter(h = .2), size = 0.5, col = "#63CAF2") +
    geom_point(data = averages, col = "#184059") +
    labs(x = "Age", y = "Sport")

  1. Develop one new question based on these data. What makes you interested in it? Provide a visualization that supports the comparisons needed to arrive at an answer.

There are many possible solutions to this problem. Some potential questions of interest include,

  • Which countries win the most medals overall?
  • Which countries win the most medals in which sports?
  • Which athletes won the most medals?
  • How does athlete age vary across both sport and gender?
  • How many athletes were born outside of the country that they competed for?

Here is an example design for the first question. Some of the interesting findings include: (i) Some countries have much larger (or lower) proportions of Gold medals, in spite of lower overall medal count (e.g., Germany and Canada), (ii) there is a long tail of countries with 1 - 2 medals. We could imagine faceting by a few of the major sports (using fct_lump to group the rare ones), but we would want to reorder separately within each facet (this will be covered in the module on text data).

olympics |>
    group_by(Country) |>
    summarise(across(Gold:Bronze, sum)) |>
    pivot_longer(-Country, names_to = "Medal") |>
    filter(value > 0) |>
    mutate(Medal = factor(Medal, levels = c("Bronze", "Silver", "Gold"))) |>
    ggplot() +
    geom_col(
    aes(value, reorder(Country, value, sum), fill = Medal),
    width = 1
    ) +
    scale_fill_manual(values = c("#cd7f32", "#c0c0c0", "#ffd700")) +
    scale_x_continuous(expand = c(0, 0, 0.1, 0)) +
    labs(x = "Medal Count", y = "Country")

Pokemon

Scoring

  • c, Design (1 point): Creative and clear visual design (1 point), Technically correct but with some inattention to detail (.5 points), Inappropriate visual encodings (0 points)
  • a - c, Code (2 point): Correct and concise (1 point), Correct but unnecessarily complex (0.5 points), Incorrect (0 points)
  • d, Discussion (1 points): Proposed designs with full elaboration (1 point), Reasonable but not fully developed proposal (0.5 points), Misunderstood or no proposal (0 points).

Question

This problem gives practice in deriving new variables to improve a faceted plot. The data below give attack and defense statistics for Pokemon, along with their types. We will build a visualization to answer the question – how do the different types of Pokemon vary in their attack and defense potential?

  1. Derive a new column containing the attack-to-defense ratio, defined as \(\frac{\text{Attack}}{\text{Defense}}\).
pokemon <- read_csv("https://uwmadison.box.com/shared/static/hf5cmx3ew3ch0v6t0c2x56838er1lt2c.csv") |>
    mutate(attack_to_defense = Attack / Defense)
  1. For each type_1 group of Pokemon, compute the median attack-to-defense ratio.
group_ratio <- pokemon |>
    group_by(type_1) |>
    summarise(group_ratio = median(attack_to_defense)) |>
    arrange(-group_ratio)
  1. Plot the attack vs. defense scores for each Pokemon, faceted by type_1. Use the result of (b) to ensure that the panels are sorted from types with highest to lowest attack-to-defense ratio.
pokemon |>
    ggplot(aes(x = Attack, y = Defense, col = Legendary)) +
    geom_abline(col = "#d3d3d3") +
    geom_point(size = 0.8) +
    scale_color_manual(values = c("#89C893", "#B74555")) +
    guides(color = guide_legend(override.aes = list(size = 6))) +
    facet_wrap(~ reorder(type_1, Attack / Defense, median), ncol = 6) +
    theme(
        panel.grid.minor = element_blank(),
        legend.position = "bottom"
    )

  1. Propose, but do not implement, a visualization of this dataset that makes use of dynamic queries. What questions would the visualization answer? What would be the structure of interaction, and how would the display update when the user provides a cue?

A variety of answers could be provided for this problem. Some potential query / inputs that could be supported include,

  • Allow users to select the type_1 group of pokemon from a dropdown menu, so that the data do not need to be faceted.
  • Create a histogram of attack-to-defense ratio, allowing users to graphically query samples with especially low or high ratios
  • Create scatterplots of other features, like speed or HP, and link graphical queries on this scatterplot with the original plot above (or with a table).

Gene Expression Faceting

Scoring

  • a, c - d Code (2 points): Correct and concise, (2 points), Correct but unnecessarily complex (1 point), Incorrect (0 points)
  • b, d, Discussion (1 point): Complete interpretation (1 point), Incomplete or unjustified interpretation (0.5 points), Incorrect (0 points)

Question

In this problem, we will experiment with several approaches to visualizing a dataset of gene expression over time. Each row below contains the expression level for one gene in one sample at a single timepoint.

genes <- read_csv("https://uwmadison.box.com/shared/static/dwzchdtfca33r0f6i055k2d0939onnlv.csv")
head(genes, 3)
## # A tibble: 3 × 4
##   sample gene   time value
##    <dbl> <chr> <dbl> <dbl>
## 1      1 Pyy   0.677   103
## 2      1 Iapp  0.677     0
## 3      1 Chgb  0.677    86
  1. Provide code to create the small multiples plot below. Note that the points have been made semi-transparent and that the \(y\)-axis is a transformation of the original value column.

The code is in the block below. We can create the small multiples using facet_wrap. Here, we accomplished the log-transformation by setting the log within the aes() call. Alternatively, we could have created a transformed column in the original data frame and referred to this.

ggplot(genes) +
    geom_point(aes(time, log(1 + value)), alpha = 0.2, size = 0.8) +
    facet_wrap(~gene, nrow = 2) +
    labs(x = "Time", y = expression(log(1 + value)))

  1. In your own words, describe one strength and one weakness of using small multiples.

Strengths:

  • Avoids overlap between similar elements.
  • Since each small multiple can be labeled, it’s possible to draw conclusions about individual elements.
  • Small multiples can be sorted to simplify some comparison (e.g., by trend).
  • Since the same encodings are used for each element, the effort of reading the overall plot is not too much higher than reading a panel on its own.

Weaknesses:

  • Can take up much more space than plots that overlay each element.
  • Comparing elements at specific \(x\) and \(y\) values across panels is challenging, because the reader has to remember relative locations within separate panels, rather than comparing elements directly.
  1. Suppose we instead wanted a heatmap of expression values with genes along rows and timepoint along columns. Provide code to draw this, starting from the gene_groups dataset defined below. Ensure that genes are sorted from most to least abundant, and also that expression values are shown on a \(\log\left(1 + x\right)\) scale.
gene_groups <- genes |>
    group_by(gene, rounded_time = round(time, 2)) |>
    summarise(mean_value = mean(value))

head(gene_groups, 3)
## # A tibble: 3 × 3
## # Groups:   gene [1]
##   gene  rounded_time mean_value
##   <chr>        <dbl>      <dbl>
## 1 Cck           0        0     
## 2 Cck           0.01     0     
## 3 Cck           0.02     0.0652

The gene-by-time tiles are the basic elements of this figure, and they correspond with the rows of the gene_groups data.frame. Therefore, it is sufficient to add geom_tile with appropriate encodings for time, gene ID, and expression level. The scale and label layers were not needed for full credit, but we do expect the genes to be reordered and fill to be on a log scale.

ggplot(gene_groups) +
    geom_tile(aes(rounded_time, reorder(gene, mean_value), fill = log(1 + mean_value))) +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_discrete(expand = c(0, 0)) +
    labs(
        x = "Time (Rounded)",
        y = "Gene (Sorted by Expression)",
        fill = "log(1 + mean_value)"
    ) +
    theme(
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 14)
    )
  1. Imagine that we trained a model that generates a smooth curve over time for each gene. It is stored in the fitted_values data below. How can you modify your solution to (a) to overlay this? Provide example code and explain how this relates to the phrase “grammar of graphics.”
fitted_values <- read_csv("https://uwmadison.box.com/shared/static/3cqvf386cr9or2mok2ttp6cl88mtanmg.csv")
head(fitted_values, 3)
## # A tibble: 3 × 4
##   gene   time     mu sigma
##   <chr> <dbl>  <dbl> <dbl>
## 1 Pyy   0.677  52.8   3.17
## 2 Pyy   0.432   2.47  1.23
## 3 Pyy   0.749 112.    1.95

We need to add a geom_line layer with its own data argument, since the rows in the original data and the fitted curve have a different meaning. (In particular, we cannot simply join the data).

ggplot(genes, aes(time, log(1 + value))) +
    geom_point(alpha = 0.2, size = 0.8) +
    geom_line(data = fitted_values, aes(y = log(1 + mu)), linewidth = 2, col = "darkred") +
    facet_wrap(~gene, nrow = 2)

This is an example of the “grammar of graphics” because we have composed a new visualization by combining several more basic layers. We were able to compose a new type of plot – “scatterplot with overlaying line” – from simple elements, rather than needing a high-level wrapper function for this specific combination.

Visual Redesign

Solutions to this problem will vary. If you would like to discuss your specific visualization and redesign, please see the instructor. An example solution from a previous year is included below.

Scoring

  • a - b, Discussion (0.75 points): Accurate and complete analysis of visualization’s goals, using concepts introduced in class (0.75 points), Generally accurate, but potentially vague or poorly referenced, analysis (0.38 points), Little evidence of specific analysis (0 points)
  • c, Discussion (0.75 points): Critical and insightful analysis of past visualization’s limitations (0.75 points), Generally correct analysis but failing to observe important limitations (0.38 points), Imprecise or poorly elaborated analysis (0 points)
  • d, Design and Code (0.75 points): Substantive improvements in new design and elegant code (0.75 points), Appropriate design and readable code (0.38 points), Negligible changes in design or unreadable code (0 points)
  • d, Discussion (1.5 points): Benefits of new design are discussed clearly and refer to concepts from class (1.5 points), Benefits of design are discussed imprecisely, (0.75 points), Missing discussion (0 points)

Question

In this exercise, you will find a visualization you have made in the past and redesign it using the skills you have learned in this course.

  1. Identify one of your past visualizations for which you still have data. Include a screenshot of this past visualization.
chocolate <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv")

Previously I wanted to check if different cocoa percentage lead to different ratings. So I did:

    library(tidyr)
    cocoa_percent <- extract_numeric(chocolate$cocoa_percent)
    boxplot(chocolate$rating ~ cocoa_percent)

  1. Comment on the main takeaways from the visualization and the graphical relationships that lead to that conclusion. Is this takeaway consistent with the intended message? Are there important comparisons that you would like to highlight, but which are harder to make in the current design?

Cocoa_percent is shown in x-axis, chocolate ratings are shown as y-axis. Previously I concluded the ratings are generally higher when cocoa_percent is between 60-80% (medium percent), because on the graph, the ‘boxes’ are ‘higher’ for medium cocoa percent while ‘lower’ when the percentage is over 80 or below 60. I think is good to make a rating vs. percent boxplot to compare them, it roughly gives us a sense of how the rating changes across cocoa percent. But we can’t tell the size of each group, boxplot only shows how the data points spread out but does not indicate which boxes have bigger samples.

  1. Comment on the legibility of the original visualization. Are there aspects of the visualization that are cluttered or difficult to read?

The x-axis does not show all the scales, but I think that is fine. However, the boxplot can not clearly show the distribution of ratings.

  1. Propose and implement an alternative design. What visual tasks do you prioritize in the new design? Did you have to make any trade-offs? Did you make any changes specifically to improve legibility.
chocolate |>
    group_by(rating) |>
    count(cocoa_percent) |>
    ggplot() +
    geom_point(aes(extract_numeric(cocoa_percent), rating, size = n, col = rating)) +
    labs(x = "cocoa percent", y = "rating")

I changed boxplot into scatterplot with the size of the dot indicating the number of samples in that category. Because in this plot, I want to stress the problem that boxplot cannot show sample size and hence we don’t know if there is truly higher rating among medium cocoa percent or it is just because of lacking of data points in higher and lower cocoa percent. From the new graph, I figure we cannot conclude higher rating for 60-80% chocolate anymore, there are barely data points < 60 or > 85, a biased conclusion may be developed based on these data.

California Wildfire Alternatives

Scoring

  • a - d, Discussion (2.5 points): Complete and accurate (2.5 points), Moderately developed and mostly accurate (1.25 points), Insufficiently developed or broadly inaccurate (0 points)
  • d, Code (0.5 points): Correct and readable code (0.5 points), Ether incorrect or unreadable code (0 points)

Question

Below, we provide three approaches to visualizing wildfire severity in the California fires dataset.

fires <- read_csv("https://uwmadison.box.com/shared/static/k5vvekf1bhh9e16qb9s66owygc70t7dm.csv") |>
    select(Name, Counties, year, day_of_year, AcresBurned, MajorIncident)
head(fires, 3)
## # A tibble: 3 × 6
##   Name        Counties   year day_of_year AcresBurned MajorIncident
##   <chr>       <chr>     <dbl>       <dbl>       <dbl> <lgl>        
## 1 Becks Fire  Lake       2013          22         296 FALSE        
## 2 River Fire  Inyo       2013          55         406 TRUE         
## 3 Jurupa Fire Riverside  2013          59         311 FALSE

For each approach, describe,

  • One type of visual comparison for which the visualization is well-suited.

  • One type of visual comparison for which the visualization is poorly-suited.

Make sure to explain your reasoning.

  1. Approach 1
ggplot(fires) +
    geom_point(aes(day_of_year, reorder(Counties, AcresBurned), size = log(AcresBurned)), alpha = 0.8) +
    facet_grid(. ~ year) +
    scale_size_continuous(range = c(0.1, 5)) +
    labs(
        x = "Day of Year",
        y = "County (Sorted by Acres Burned)",
        fill = "log(AcresBurned)"
    ) +
    theme(
        axis.text.y = element_text(size = 6),
        axis.text.x = element_text(size = 8)
    )

This figure is effective for,

  • Observing broad changes in the total number of fires from year to year.
  • Identifying counties with especially high or low numbers of fires during a single year.
  • Recognizing which days of the year consistently have more or fewer fires, as well as fires that occur at unusual times of year.

This figure is ineffective for,

  • Learning the names of any particular fire of interest (e.g., any of the large ones).
  • Describing trends in the total number of acres burned, either across all or within specific counties.
  • Identifying the counties with the most acreage burned during any given year.
  1. Approach 2
ggplot(fires) +
    geom_boxplot(aes(factor(year), log(AcresBurned), fill = MajorIncident)) +
    scale_fill_brewer(palette = "Set2") +
    labs(fill = "Major Incident?", x = "Year") +
    theme(
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 14)
    )

This figure is effective for,

  • Comparing the relative sizes of major and minor incident fires.
  • Describing key summary statistics (min, max, quartiles) of the fire size distribution within each year.
  • Highlighting the some fires that are not major incidents can have many acres burned.
  • Identifying outlying years, with more or less acres burned than the neighboring years.

This figure is ineffective for,

  • Determining the number of fire incidents from year to year.
  • Summarizing geographic variation in the number and size of fires.
  1. Approach 3
fires |>
    slice_max(AcresBurned, n = 18) |>
    ggplot() +
    geom_col(aes(AcresBurned, reorder(Name, AcresBurned), fill = factor(year))) +
    scale_x_continuous(expand = c(0, 0, 0.1, 0)) +
    scale_fill_brewer(palette = "Set3") +
    labs(y = "Fire", x = "Acres Burned", title = "Fires with the Most Acres Burned", fill = "Year") +
    theme(
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 14)
    )

This figure is effective for,

  • Giving the names of the largest fires in the data.
  • Highlighting that, even among the largest fires in the dataset, a few stand out as having many more acres burned than the rest.
  • Summarizing whether certain years had an unusually high/low number of especially devastating fires.

This figure is ineffective for,

  • Summarizing geographic variation in the number and size of fires.
  • Determining the number of fire incidents from year to year.
  • Describing temporal trends in the sizes or numbers of fires.
  1. Provide the code that could be used to create one of the figures above. If the original data need to be transformed/reshaped, include code for this as well.

    The code for each example is included in each part.

Homelessness

Scoring

  • a, Discussion (1 point): Well-developed discussion of potential data sources, e.g., describing what each measurement represents and how features may have been collected (1 point), Technically sound but somewhat underdeveloped response (0.5 points), No response or response without justification (0 points)
  • b, Discussion (0.5 points): Correct identification of data types for all displayed features (0.5 points), Incomplete discussion (0.25 points), Incorrect discussion of data types (0 points)
  • c, Discussion (1 point): Correct understanding of vocabulary for graphical encodings, and a complete discussion for all components of the figure (1 point), Correct but incomplete discussion of encodings (0.5 points), No or incorrect discussion (0 points)
  • d, Discussion (0.5 points): Correct identification and justification of multli-view composition (0.5 points), Correct identification with inappropriate justification (0.25 points), Incorrect identification (0 points)

Question

Take a static screenshot from any of the visualizations in this article, and deconstruct its associated visual encodings.

  1. What do you think was the underlying data behind the current view? What where the rows, and what were the columns?

Each row of the original dataset corresponds to one traveler. The measured variables include (i) the origin city, (ii) the destination city, (iii) the difference between median incomes in the origin and destination cities (notice that the annotations contradict each other – one refers to an average income, the other to a median). The counts within bins of income difference were then derived, though it is unclear what the size of these bins are.

  1. What were the data types of each of the columns?

The origin and destination cities are categories with many levels. The difference in median income is a quantitative variable.

  1. What encodings were used? How are properties of marks on the page derived from the underlying abstract data?

Vertical height is used to encode the change in median income from source to destination. Color encodes whether the change is positive or negative. The heights of the smoothed histogram on the right encode the number of travelers within that income difference bin. The widths of the paths from the left hand side of the display to the histogram also encode the number of travellers within that bin.

  1. Is multiview composition being used? If so, how?

Yes, multiview composition is being used. The sankey/flow diagram on the left and the histogram on the right are shown simultaneously, and they use a common \(y\)-axis. Nonetheless, each could stand alone as an independent figure.

Besides these analysis questions, it’s worth noting the work’s excellent use of

  • Annotation: The figure can be interpreted without reference to the main article. A few examples (Chico to Seattle and New York to San Juan) also provide useful points of reference.
  • Aesthetic freedom: In theory, the exact same information could be encoded using just the histogram on the right. However, the additional flows bring to mind the imagery of bussing that recurs throughout the piece. While the visual encoding is redundant, it is nonetheless effective within the context of the larger piece.